Optical Character Recognition (OCR) for copying text from pdf files
I was about to key in chunks of information into my excel spreadsheet at work this week. Usually, I will try to update my spreadsheet with the information that I read from books, journals, trade magazines so that I can easily retrieve the information about different flavor chemicals just by looking up their CAS number. I learnt this the hard way, because when I first started out on my job, I was giving out reports on flavor composition of different foods to colleagues who were not familiar with flavors. I was not much better as I only first started out on the job and the typing in of the chemicals was already very much a learning task for me to make sure that there are no typos. To allow everyone to know what each flavor chemical was, I went to copy and paste the description individually for the 50 or 60 plus flavor chemicals, each time, when I had to share the report. I spent so much more type copying and pasting then actually intepreting the report and familarising myself with the chemicals!
Finally I realised I should have an excel spreadsheet to store the information. My boss is very experienced in this field and already committed to memory all the different chemicals and their properties, but I was just starting out and had few bytes in my memory.
I built and built my database, by manually typing what I come acorss, until I learnt that Flavorbase could be exported out in excel format, and I joined the databases together, until my colleagues in other locations sent me their database and I merged them together.
Now, I want to add the information from Fenaroli’s handbook of flavor ingredients into my database. I was thinking if I should manually type the over 2000 pages of information? Or is there a better way?
A search on Google resulted in a relevation…. R has this package called tesseract for text mining! I used tabulizer before, there were some hits and misses and I often had to do a lot of cleaning before the data could be used.
I was looking at the vignette intro for the package and decided to try it out for myself. My text cleaning and string manipulation skills can be further improved, but I had a good first attempt at two of the pages!
However, I realised I had problem with extracting data stored in tables in the pdf file. This led me to the pdftools package. The problem with this package is that I cannot split the text into sections like what I did for Tesseract.
how to use tesseract to convert pdf to images, or straight away use images for text mining
how to extract chemical name, cas number, fema number using pdftools
merging the two tables with different information today
I followed the steps listed on: https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html
I chose a page that had all the information printed on a single page. This is the ideal scenario, just for purposes of demonstrating that it will work for me.
A screenshot of the file is shown below.
Importing the file:
compound_png <- pdftools::pdf_convert("test.pdf", dpi = 600) # this file should be saved in your computer
Converting page 1 to test_1.png... done!
Use optical character recognition on the png file.
Concatenate and print (cat) the file.
ALLYL SORBATE 67
DFE
ALLYL PROPYL DISULFIDE
Synonyms: Allyl propyl disulphate; Disulfide, 2-propenyl propyl; Disulfide, allyl propyl; 2-Propenyl propyl disulfide; Propeny]
propyl disulfide; 4,5-Dithia-1-octene; Propyl allyl disulfide
[CoBNo.: [600 [EINECS No.: [218-550-7__[JECFANo.: [1700 |
Description: Colorless to yellowish liquid; fruity, garlic aroma.
Consumption: Odor and/or flavor used in cabbage, tropical fruit, garlic, leek, and onion Annual: n/a Individual: n/a
Regulatory Status:
CoE: n/a
FDA: n/a
FDA (other): n/a
JECFA: ADI: Acceptable. No safety concern at current levels of intake when used as a flavoring agent (2007).
Trade association guidelines: FEMA PADI 0.091 mg IOFI: n/a
Empirical Formula/MW:
C.H,,S,/148.29 ee NSN
Specifications: (JECFA, 2008)
Reported uses (ppm): (FEMA, 2005)
Synthesis: n/a
Aroma threshold values: High strength odor, sulfurous type; recommend smelling in a 0.10% solution or less.
Taste threshold values: Taste like that of cooked onions.
Natural occurrence: Reported as the chief volatile constituent in onion oil and found in raw cabbage, chive, garlic oil, leek and
onion.
DFE
ALLYL SORBATE
Synonyms: Allyl-2,4-hexadienoate; Allyl hexa-2,4-dieonoate; Allyl sorbate; 2-Propenyl sorbate; 2,4-Hexadienoic acid, 2-prope-
nyl ester, (E,E)-; (E, E)-2-Propenyl 2,4-hexadienoate; 2,4-hexadienoic acid, 2-propen-l-yl ester (2E, 4E)-
[CoE No.: [2182 [EINECS No.: [231-336-8 JECFANo: [8 |
Description: Allyl sorbate is a colorless liquid with a fruital pineapple-like odor.
Consumption: Annual: <1.00 Ib Individual: 0.00000061 mg/kg/day
# caps
upper_case_pattern <- "\\b[A-Z]+\\b" # CAPS
# caps before the word synonyms
caps_before_synon_pattern <-"\\b[A-Z]+\\b.+(?<=Synonyms)"
# define cas number pattern
cas <- "[[:digit:]]+-[[:digit:]]+-[[:digit:]]+" # digits - digits - digits
# description: between description and consumption
description_pattern <- "(?<=Description: ).+(?= Consumption)"
# Consumption: between Consumption and Regulatory
consumption_pattern <- "(?<=Consumption: ).+(?= Regulatory)"
# Aroma: between values and Taste
aroma_pattern <- "(?<=values: ).+(?= Taste threshold)"
# Taste: between values and Natural
taste_pattern <- "(?<=Taste threshold values: ).+(?= Natural)"
text_clean <- text %>%
str_replace_all("\\n", " ") %>%
as_tibble() %>%
str_split_fixed(., " DFE ", 4) %>% # split into sections by DFE
as_tibble(.name_repair = "unique") %>%
pivot_longer(everything()) %>%
mutate(no_of_char = map_dbl(value, str_length)) %>%
filter(no_of_char > 50) %>% # to set threshold to filter out irrelevant ones
mutate(compound = str_extract_all(value, caps_before_synon_pattern,
simplify = T),
compound_clean = str_trim(str_replace_all(compound, "Synonyms", "")),
description = str_extract_all(value, description_pattern,
simplify = T),
consumption = str_extract_all(value, consumption_pattern,
simplify = T),
aroma = str_extract_all(value, aroma_pattern,
simplify = T),
taste = str_extract_all(value, taste_pattern,
simplify = T))
text_clean
# A tibble: 2 × 9
name value no_of_char compound[,1] compound_clean description[,1]
<chr> <chr> <dbl> <chr> <chr> <chr>
1 ...2 "ALLY… 1107 ALLYL PROPYL… ALLYL PROPYL … Colorless to y…
2 ...3 "ALLY… 449 ALLYL SORBAT… ALLYL SORBATE Allyl sorbate …
# … with 3 more variables: consumption <chr[,1]>, aroma <chr[,1]>,
# taste <chr[,1]>
# extract first one after compound name
table_pdf <- table_text %>%
str_split_fixed(., "\n\n\n", 4) %>%
as_tibble() %>%
pivot_longer(everything()) %>%
mutate(desc = str_squish(value)) %>%
mutate(no_of_char = map_dbl(value, str_length)) %>%
mutate(compound = str_extract_all(desc, caps_before_synon_pattern,
simplify = T),
compound_clean = str_trim(str_replace_all(compound, "Synonyms", ""))) %>%
select(-compound, value) %>%
mutate(cas = str_extract_all(desc, "(?=CAS No.: ).+(?=FL No.)",
simplify = T),
cas_cleaned = str_trim(str_replace_all(cas, "CAS No.:", ""))) %>%
select(-cas) %>%
mutate(fema = str_extract_all(desc, "(?=FEMA No.: ).+(?= NAS)",
simplify = T),
fema_cleaned = str_trim(str_replace_all(fema,
"FEMA No.:", "" ))) %>%
select(-fema) %>%
replace_with_na_at(.vars = c("cas_cleaned", "fema_cleaned"),
condition = ~.x == "") %>%
# remove entries without cas number
filter(!is.na(cas_cleaned))
table_pdf
# A tibble: 2 × 7
name value desc no_of_char compound_clean cas_cleaned fema_cleaned
<chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 V2 "\nA… ALLY… 1187 ALLYL PROPYL … 2179-59-1 4073
2 V4 "\nA… ALLY… 684 ALLYL SORBATE 7493-75-6 2041
As I am unable to split the table_pdf by sections, I will only extract the compound name, CAS and FEMA from the table.
table_pdf
# A tibble: 2 × 7
name value desc no_of_char compound_clean cas_cleaned fema_cleaned
<chr> <chr> <chr> <dbl> <chr> <chr> <chr>
1 V2 "\nA… ALLY… 1187 ALLYL PROPYL … 2179-59-1 4073
2 V4 "\nA… ALLY… 684 ALLYL SORBATE 7493-75-6 2041
text_clean
# A tibble: 2 × 9
name value no_of_char compound[,1] compound_clean description[,1]
<chr> <chr> <dbl> <chr> <chr> <chr>
1 ...2 "ALLY… 1107 ALLYL PROPYL… ALLYL PROPYL … Colorless to y…
2 ...3 "ALLY… 449 ALLYL SORBAT… ALLYL SORBATE Allyl sorbate …
# … with 3 more variables: consumption <chr[,1]>, aroma <chr[,1]>,
# taste <chr[,1]>
merged <- text_clean %>%
left_join(table_pdf, by = "compound_clean",
suffix = c("_text", "_table")
) %>%
select(value_text, compound_clean, cas_cleaned, fema_cleaned,
description, consumption, aroma, taste) %>%
map_df(., str_squish)
glimpse(merged)
Rows: 2
Columns: 8
$ value_text <chr> "ALLYL PROPYL DISULFIDE Synonyms: Allyl propy…
$ compound_clean <chr> "ALLYL PROPYL DISULFIDE", "ALLYL SORBATE"
$ cas_cleaned <chr> "2179-59-1", "7493-75-6"
$ fema_cleaned <chr> "4073", "2041"
$ description <chr> "Colorless to yellowish liquid; fruity, garli…
$ consumption <chr> "Odor and/or flavor used in cabbage, tropical…
$ aroma <chr> "High strength odor, sulfurous type; recommen…
$ taste <chr> "Taste like that of cooked onions.", ""
Allyl sorbate is an incomplete entry, so the text mining is not complete.
However, this is a great start for me to start scraping!
My next step would be to try to scrape more pages and see how I can merge the data together.
Code chunk for trying out, before adding to the final cleaning step:
# let me try on test text first
test_text <- text_clean %>%
filter(no_of_char > 1000) %>%
select(value) %>%
pull()
test_text
[1] "ALLYL PROPYL DISULFIDE Synonyms: Allyl propyl disulphate; Disulfide, 2-propenyl propyl; Disulfide, allyl propyl; 2-Propenyl propyl disulfide; Propeny] propyl disulfide; 4,5-Dithia-1-octene; Propyl allyl disulfide [CoBNo.: [600 [EINECS No.: [218-550-7__[JECFANo.: [1700 | Description: Colorless to yellowish liquid; fruity, garlic aroma. Consumption: Odor and/or flavor used in cabbage, tropical fruit, garlic, leek, and onion Annual: n/a Individual: n/a Regulatory Status: CoE: n/a FDA: n/a FDA (other): n/a JECFA: ADI: Acceptable. No safety concern at current levels of intake when used as a flavoring agent (2007). Trade association guidelines: FEMA PADI 0.091 mg IOFI: n/a Empirical Formula/MW: C.H,,S,/148.29 ee NSN Specifications: (JECFA, 2008) Reported uses (ppm): (FEMA, 2005) Synthesis: n/a Aroma threshold values: High strength odor, sulfurous type; recommend smelling in a 0.10% solution or less. Taste threshold values: Taste like that of cooked onions. Natural occurrence: Reported as the chief volatile constituent in onion oil and found in raw cabbage, chive, garlic oil, leek and onion."
# To define different text patterns
str_extract_all(test_text,"(?<=Taste threshold values: ).+(?= Natural)")
[[1]]
[1] "Taste like that of cooked onions."
text_clean %>%
filter(compound_clean == "ALLYL SORBATE") %>%
select(value) %>%
pull()
[1] "ALLYL SORBATE Synonyms: Allyl-2,4-hexadienoate; Allyl hexa-2,4-dieonoate; Allyl sorbate; 2-Propenyl sorbate; 2,4-Hexadienoic acid, 2-prope- nyl ester, (E,E)-; (E, E)-2-Propenyl 2,4-hexadienoate; 2,4-hexadienoic acid, 2-propen-l-yl ester (2E, 4E)- [CoE No.: [2182 [EINECS No.: [231-336-8 JECFANo: [8 | Description: Allyl sorbate is a colorless liquid with a fruital pineapple-like odor. Consumption: Annual: <1.00 Ib Individual: 0.00000061 mg/kg/day "
https://cran.r-project.org/web/packages/tesseract/vignettes/intro.html
BURDOCK, G. A., & FENAROLI, G. (2005). Fenaroli’s handbook of flavor ingredients. Boca Raton, Fla, CRC Press.
For attribution, please cite this work as
lruolin (2021, Oct. 15). pRactice corner: Text mining from pdf files with Tesseract and pdftools. Retrieved from https://lruolin.github.io/myBlog/posts/20211015 Text mining with tesseract/
BibTeX citation
@misc{lruolin2021text, author = {lruolin, }, title = {pRactice corner: Text mining from pdf files with Tesseract and pdftools}, url = {https://lruolin.github.io/myBlog/posts/20211015 Text mining with tesseract/}, year = {2021} }